Decomposition of Response Time to Give Better 
Prediction of Children’s Reading Comprehension 


Zhila Aghajari * 
Lehigh University 
Bethlehem, PA 


Deniz Sonmez Unal * 
University of Pittsburgh 
Pittsburgh, PA 


Mesut Erhan Unal 
University of Pittsburgh 
Pittsburgh, PA 


zha219@lehigh.edu des204@pitt.edu meu6@pitt.edu 
Ligia GOmez Erin Walker 
Arizona State University University of Pittsburgh 


Tempe, AZ 
ligia.gomez@asu.edu 


ABSTRACT 


Response time has been used as an important predictor of 
student performance in various models. Much of this work is 
based on the hypothesis that if students respond to a prob- 
lem step too quickly or too slowly, they are most likely to be 
unsuccessful in that step. However, something that is less 
explored is that students may cycle through different states 
within a single response time and the time spent in those 
states may have separate effects on students’ performance. 
The core hypothesis of this work is that identifying the dif- 
ferent states and estimating how much time is devoted to 
them in a single response time period will help us predict 
student performance more accurately. In this work, we de- 
compose response time into meaningful subcategories that 
can be indicative of helpful or harmful cognitive states. We 
then show how a model that is using these subcategories as 
predictors instead of response time as a whole outperforms 
both a linear and a non-linear baseline model. 


Keywords 
Response time, student modeling, regression models, on- 
task and off-task behaviors 


1. INTRODUCTION 


Intelligent Tutoring Systems (ITS) help students learn a 
wide variety of skills from problem solving [23] to reading 
[22, 19]. To improve ITS designs, researchers often study 
students’ learning patterns to identify their relationship to 
performance and target them for intervention. Within this 
context, response time has been widely used to predict stu- 
dent performance [39, 40] and to interpret cognitive and 
motivational states during ITS use [39, 3, 7]. 


Much of the research involving response time is based on 
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the hypothesis that the relationship between response time 
and student performance is non-linear [9]. Fast or slow re- 
sponse times may be indicative of both helpful and harmful 
cognitive states. For example, a fast response time could be 
a result of either mastery of a skill or guessing. Likewise, 
a long response time could be because of struggling or be- 
ing off-task. Contextual information surrounding response 
time is often used in identifying the correct cognitive or mo- 
tivational states. For example, a long response time after 
reading a bug or a hint message can be linked to reflection 
[32], whereas a short response time after such actions can be 
a sign of gaming the system [4]. Thus, previous literature 
has focused on identifying students’ cognitive states based 
on sequences of actions and the time spent between them 
[4, 2, 7, 5]. However, students may go through different 
cognitive states even within a time period between consecu- 
tive actions [32]. Despite a large body of research dedicated 
to studying students’ cognitive states, little is known about 
the different states a student might be in during a single re- 
sponse time and how time spent in those states would affect 
learning. 


We hypothesize that response time can be divided into sub- 
categories that can be indicative of some helpful and harm- 
ful cognitive states, and that identifying time spent on these 
states within one response time can improve student perfor- 
mance prediction. In our previous work [35], we divided re- 
sponse times during a reading comprehension task into two: 
reading and thinking time. Results of a piecewise regres- 
sion model revealed that thinking time could include four 
states: gaming, productive thinking, wheel spinning, and 
mind wandering. With the insight from these results, we 
further investigate the different states that could occur in 
one response time. We compare a model that is based on 
decomposition of response time to a linear baseline model 
which only uses average response time, and also to a non- 
linear baseline (a piecewise regression). By decomposing the 
response time, we show that students can go through multi- 
ple cognitive states in between log events. We also show that 
by identifying how much time is devoted to these states, we 
can improve the predictive models of student performance. 


2. RELATED WORK 
2.1 Cognitive and Motivational States 
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Within this section we review the cognitive states that are 
related to productive thinking, gaming the system, and un- 
productive thinking. We extract these states from the broader 
research literature, although it should be noted that these 
states were also identified by teachers as important [13]. 


We gather learning events that are associated with robust 
learning under productive thinking behaviors. [17] di- 
vided these events into three categories: understanding and 
sense-making processes, induction and refinement processes, 
memory and fluency-related processes. Some example be- 
haviors that fall under understanding and sense-making pro- 
cesses and induction and refinement processes that could be 
relevant in a reading domain are self-explanation and self- 
reflection. These behaviors are shown to be positively re- 
lated to learning [32, 8]. 


Gaming the system is an undesirable cognitive state wherein 


students try to reach the correct answers and advance in the 
lesson by systematically misusing the features of the system 
[3]. It is linked to short response times and rapid actions [3]. 
[29] divides gaming into two main types: systematic guess- 
ing and help abuse. Systematic guessing could be inferred 
from short response times between step attempts [12, 28, 2], 
entering the same answer in multiple contexts, and entering 
similar answers [29]. Help abuse was defined as searching 
for bottom-out hints, asking for help without any reflection 
on the help, and entering multiple incorrect answers despite 
receiving help. 


Within unproductive thinking states, we review wheel 
spinning and mind wandering. Wheel spinning occurs 
when the student makes an effort but does not succeed. It 
is linked to long response times and many help requests. [7] 
illustrates that if students need help solving the first twenty 
problems they are in wheel spinning phase, and presenting 
them with more problems will not be helpful. [5] showed 
wheel spinning is negatively correlated with flow, positively 
correlated with gaming and confusion, and not correlated 
with boredom. Mind wandering occurs when students in- 
voluntarily shift their attention to task-unrelated thoughts 
[15, 14, 34], and is associated with distraction or boredom. 
This cognitive state occurs 20-40% of the time during read- 
ing [30] and causes students to fail in gaining reading com- 
prehension skills [33, 36]. As mind wandering occurs in- 
voluntarily, it is very difficult to measure, and it is often 
measured using self-reported approaches [24]. 


2.2 Response Time in Student Modeling 
Response time has been widely used in different kinds of stu- 
dent models, and can improve the accuracy of those models 
[39, 20]. For example, [4] presents a model that uses response 
time to detect shallow learning, and [11] predicts student 
performance in transfer learning using response time. [6] 
developed an item response theory (IRT) model to show an 
overall level of students engagement by analyzing response 
times, problem difficulty, and correctness of responses. [16] 
also presents an IRT-based model to estimate student profi- 
ciency and motivation level where motivation was measured 
based on time spent between actions and a short response 
time was an indicator of unmotivated behavior. 


In this paper, we are inspired by work that centers response 


time as a non-linear predictor of students’ performance. [9] 
suggests that the relationship between time and student suc- 
cess is not linear, and there is an ideal range of time for stu- 
dents to respond to a problem. In [10], they further support 
this non-linear relationship by showing that including time 
as a quadratic predictor instead of linear yields to a better 
prediction of students’ performance. These studies support 
the intuition that accounting for the activities in different 
ranges of response time can give a better prediction of stu- 
dent performance. 


Other efforts have shown success in estimating time spent 
on the activities that occur within a single response time. 
These efforts involved decomposing response time. [32] pre- 
sented a model that predicts student performance relying 
on estimation of activities that cannot be directly observed 
from the log data such as thinking about hints, entering an 
answer, and reflecting on the hints. The preliminary results 
of our previous work that decomposes response times in a 
reading comprehension domain also revealed that students 
may go through multiple cognitive states during a single re- 
sponse time period [35]. 


In this work, we aim to show that identifying time spent 
on different cognitive states within the response time will 
provide better predictions of student performance. 


3. CORPUS AND MEASURES 

The datasets used in our work are log data collected during 
two studies with an iPad application called EMBRACE [38]. 
EMBRACE is designed to help young dual-language learn- 
ers improve their reading comprehension in English. The 
students read interactive story books divided into chapters 
and they answer 3 to 9 multiple choice questions about the 
text at the end of each chapter. Books consisted either of 
narrative stories or of informational texts. Students see the 
text they should read in a box and they press a button la- 
beled “Next” at the bottom of the screen to move from one 
sentence to another. They also see images representing what 
is depicted by the text. 


In the full versions of EMBRACE, students are asked to 
either imagine the highlighted sentences or move the im- 
ages on the screen to enact these sentences. They can get 
feedback based on how they are moving the images. Some 
features that are in the full versions of EMBRACE are not 
provided in the control version. Students still see the images 
in this version as well as the highlighted sentences, however, 
the only actions that they can perform are tapping on words 
to hear their pronunciations, and pressing the “Next” button 
to move to the next sentence. In this work, we are particu- 
larly interested in the control version as it gives us a more 
restricted set of student actions, which better enables us to 
focus on the role response time plays in reading comprehen- 
sion. In the control version, we use the following measures: 


1. Student performance: The proportion of correctly 
answered questions at the end of the chapters. 


2. Response time: The time spent between when the 
sentence is first loaded and when the student presses 
the ‘Next’ button to proceed to the next sentence. 


3. Help requests: The frequency of tapping on an un- 
derlined word to hear its pronunciation in a sentence. 
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Response time initially includes time spent on gaming, reading and thinking. 


Step 1. Calculating gaming time and subtracting it from response time. 


Step 2. Calculating reading time and subtracting it from remaining portion 
of response time. 


Step 3. Distinguishing between productive and unproductive thinking times. 


Response Time 


v 
Reading 
v 
Productive 
Thinking 


Figure 1: The protocol of designing productive vs. unproductive thinking portions in students learning. 


4. Frequency of gaming: The “Next” button is dis- 
abled for 1 to 3 seconds depending on the length of 
the sentence to encourage the students to read the sen- 
tence completely. However, students might try to skip 
the current sentence and press this button while it is 
disabled. Frequency of this behavior within a sentence 
is our indicator of gaming since systematic and rapid 
actions to advance in the curriculum has been identi- 
fied as gaming in previous research [3]. Note that this 
metric is not available in the second dataset. 


5. Decoding ability: Decoding is defined as the ability 
to correctly pronounce written words. Our decoding 
measure is the student’s score on the decoding part 
of Qualitative Reading Inventory (QRI) [18] that is in 
range [0, 40]. 


6. Sentence difficulty: We used the Flesch-Kincaid read- 
ability grade level (FK) [21] to measure sentence dif- 
ficulty. It is based on number of words in the sen- 
tence, and syllables in words. This measure represents 
the grade level required to understand a certain text. 
The difficulty of each chapter is calculated based on 
the average difficulty of the sentences in the chapter. 
FK is often used for long texts rather than single sen- 
tences. To confirm this measure is also appropriate 
for computing sentence level difficulty, we also com- 
puted chapter difficulty by applying FK on complete 
chapter texts. We did not observe a noticeable dif- 
ference between computing sentence level difficulties 
per chapter(M = 4.81, SD = 1.39) and applying FK 
on complete chapter texts (M = 4.64, SD = 1.35) as 
RMSE = .37. 


In our datasets, data points are distinct student-chapter 
pairs as student performance can only be calculated in chap- 
ter level. The first dataset includes 22 students who are 
native Spanish speakers from second to fourth grade with 
mean QRI score 34.71 (SD = 5.19). One student is ex- 
cluded from the dataset due to having scored less than 50% 
on the QRI test, and thus being unable to effectively use 
the application. We also excluded the first chapters of the 
books that were read out loud to the student by the applica- 
tion. Finally, some of the student-chapter pairs are excluded 
from the dataset due to logging errors such as unrealistic re- 
sponse times or not completing the chapter. In total, we 
had data from 21 students, and 716 distinct student-chapter 
pairs. The mean number of book chapters per student is 
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Table 1: Descriptive statistics of time (in seconds) 
subcategories across student-chapter pairs in the 
first dataset (Spanish) 


Measurements Mean SD Min Max 
Reading Time 6.04 | 1.67 ) 1.11 | 11.44 
Productive Thinking 2.18 | 1.72 0 7.84 
Unproductive Thinking 1.68 | 4.01 0 39.47 
Gaming Time 0.21 | 0.44 0 5.36 
Time Spent on Help 0.15 | 0.09 | 0.07 | 0.33 
Time Spent on Sentence | 10.12 | 6.02 | 2.75 | 51.5 


Table 2: Descriptive statistics of time (in seconds) 
subcategories across student-chapter pairs in the 
second dataset (Mandarin) 


Measurements Mean SD Min Max 
Reading Time 4.40 | 0.65 | 2.67 | 6.12 
Productive Thinking 2.68 | 1.07 | 0.01 | 4.09 
Unproductive Thinking 1.45 | 3.08 0 27.79 
Gaming Time 0.08 | 0.38 0 4.09 
Time Spent on Help 0.84 | 0.60 | 0.06 | 3.61 
Time Spent on Sentence | 9.17 | 4.29 | 2.81 | 39.56 


34.09 (SD = 2.3) with mean sentence difficulty 4.82 (SD = 
2.84) across 7 story books. 


In the second dataset, collected from an earlier experiment, 
we had 24 native Mandarin speaker students from seventh to 
ninth grade with mean QRI score 37.42 (SD = 1.79). Only 
one student-chapter pair was excluded from the dataset as 
the student in that pair did not complete the assessment 
task for the chapter. In this dataset we had 479 distinct 
student-chapter pairs. The mean number of book chapters 
per student is 19.95 (SD = 0.20) with mean sentence diffi- 
culty 4.14 (SD = 1.06) across 4 story books. 


4. RESPONSE TIME DECOMPOSITION 


Figure 1 visualizes how we decompose response time at a 
high level. In the following subsections we describe how each 
time subcategory was computed in detail. The descriptive 
statistics of the time subcategories for the datasets are given 
in Tables 1 and 2. 


4.1 Time Spent on Gaming 
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For each sentence, if the student never pressed “Next” when 
it was disabled, the gaming time on that sentence is 0. Oth- 
erwise, we first calculate how long the student waited after 
the last time they pressed “Next” when it was disabled until 
they actually passed the sentence. We calculated gaming 
time by subtracting this waiting time from total time spent 
on sentence. A student who waited for a long time to pass 
the sentence after pressing “Next” when it was disabled will 
have a low gaming time estimate. Note that, in the second 
dataset, since our gaming indicator was not available, we 
did not include gaming time in our analyses. 


4.2 Time Spent on Reading 

We first estimated how many words students should read per 
minute based on their grade according to [25]. For example, 
if a student is in third grade, they should be able to read be- 
tween 120 to 170 words per minute. To give a more specific 
estimate for reading rate, instead of using only the student’s 
grade, we include their ability in decoding English words and 
sentence difficulty, as students with a higher decoding abil- 
ity may read more words. Similarly, in more difficult texts, 
students may read fewer words per minute. We first divide 
the normalized decoding score by the normalized sentence 
difficulty for each student and sentence pair. Let [a,b] be the 
interval representing the possible values of this measure. We 
create another interval [c, d], by getting the possible values 
of how many words students should read per minute within 
our students’ grade levels from [25]. We simply map inter- 
val [a,b] on interval [c,d] using the linear mapping formula 
below: 


f(x) = e+ ((d—c)/(b— a)) * (w — a) (1) 
Here, x is one specific decoding /difficulty score for a student- 
sentence pair and f(x) will give an estimate for how many 
words this student should read adjusted by the student’s 
decoding ability and the difficulty of the sentence. Then, we 
simply calculated the time spent on reading for each sentence 
based on the student’s reading rate and the word count in 
the sentence. For example, if a student is estimated to read 
120 words per minute, their reading time estimate for a 6- 
word sentence is 3 seconds. 


Sw 


Tready,s = * 60s (2) 


Uwpm, 
Here Tyead,,, denotes the time estimate for student u to read 
sentence s, S~ denotes the number of words in sentence s 
and Uuwpm, denotes the rough estimate of the reading rate 
for student u while reading sentence s. 


4.3 Time Spent on Help Requests 

We computed the exact time it takes to play the help au- 
dios. Then we computed the time spent on help requests by 
multiplying the time it takes to play the tapped words by 
two as each word is played twice. 


4.4 Time Spent on Thinking 
Finally, we calculate the thinking time by simply subtracting 
gaming time, reading time and time spent on help requests 


from total time spent on one sentence. 
Tthinky, s = Ttotalu,s = (Teame,,. + Treadu,s + Thelp,,,s) (3) 


Following this procedure, thinking time was estimated to 
be negative for 34% of the data points as the reading time 
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Figure 2: Example student-level thresholds for a 
high decoding (left) and a low decoding student 
(right). 


estimate was higher than the total time spent on sentence. 
In that case, we simply adjust reading time estimate so that 
the time spent on sentence would be equal to reading time, 
and thinking time would be assigned to 0, which means that 
the time spent on sentence was devoted to reading and/or 
gaming. Even though zeroing-out negative thinking times 
seems to remove the variance that could be indicative of 
student performance, we did not observe any difference in 
terms of model performance. Moreover, doing so resulted in 
thinking time estimates becoming more interpretable. 


4.5 Distinguishing Between Productive and Un- 


productive Thinking Time 

To distinguish between productive and unproductive think- 
ing, we use a data-driven method to find a threshold in 
thinking time for a student and chapter where spending 
more time on thinking after passing that threshold will be 
unhelpful. We first estimate that threshold at the student 
level and then similarly at the chapter level. We then com- 
bine the two thresholds to estimate one threshold for each 
student-chapter pair. 


To find student level thresholds, using the segmented func- 
tion in R [26, 27], we fit a separate piecewise regression 
model with our performance measure as the outcome and 
the mean time spent on thinking as the predictor for each 
student (R? = 0.24). There will be one breakpoint in think- 
ing time for each student which will be independent of the 
chapter. Similarly, to estimate the thresholds at the chapter 
level, we fit one piecewise regression model with the perfor- 
mance measure as the outcome and the mean time spent on 
thinking as the predictor for each chapter (across all stu- 
dents) (R? = 0.23). The breakpoints represent the thresh- 
olds distinguishing between productive and unproductive 
thinking times for chapters. Figure 2 shows example thresh- 
olds returned from the piecewise regression models for a high 
decoding and a low decoding student, and Figure 3 shows 
example thresholds for an easy and a difficult chapter. High 
and low decoding students and easy and difficult chapters 
were decided based on median splits. 


Although the separate thresholds we found are reasonable 
estimates, we do not use them directly when deriving the 
time spent on productive thinking, for two reasons. First, 
the threshold between productive and unproductive think- 
ing time should be adjusted to both student and chapter 
characteristics in the same way that we adjusted time spent 
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Figure 3: Example chapter-level thresholds for an 
easy chapter (left) and a difficult chapter (right). 
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score and estimated student-level thresholds (left), 
and chapter difficulty and estimated chapter-level 
thresholds (right). 


on reading. Second, we estimated these points by building 
a model which predicts the student performance, the vari- 
able that we would like to predict. Time spent on produc- 
tive and unproductive thinking will be used as predictors in 
the model that we propose in this work. While extracting 
these features, leaking information from our outcome mea- 
sure may cause overfitting in the final model. Therefore, we 
combine the thresholds found from separate regression equa- 
tions. We build two separate linear regression models to pre- 
dict the ‘true’ student and chapter level thresholds. Then 
we combine the two equations by taking their weighted av- 
erage. This allows us to have one threshold estimate for an 
arbitrary student-chapter pair based on the decoding ability 
of the student and the difficulty of the chapter. 


Figure 4 shows the relationship between QRI and ‘true’ 
student-level thresholds, and the relationship between chap- 
ter difficulty and ‘true’ chapter level thresholds. As seen in 
the figure, the estimated student-level thresholds are nega- 
tively correlated with decoding ability. This indicates that 
segregates productive and unproductive thinking regions oc- 
curs earlier for the students who scored better in the de- 
coding test. The same figure also shows that chapter-level 
thresholds and chapter difficulties are far less correlated, 
which suggests that in estimating a threshold based on both 
student and chapter characteristics, the student character- 
istic (decoding score) is more important than the chapter 
characteristic (difficulty). 


To combine these thresholds, we first find the equation for 
chapter difficulty and thinking time thresholds. 


Achapter — Bo + By * DIFF chapter (4) 


Here DIFF chapter denotes the chapter difficulty, and By and 
Bo are the estimated slope and y-intercept of the linear equa- 
tion respectively. Similarly, we learn an equation for think- 
ing time threshold in student level as follows: 


Astudent = Co a C1 * QR student (5) 


where QRI.;,qent denotes the student’s decoding score, and 
C and Co denote the estimated slope and y-intercept of this 
linear equation respectively. We combine these two sepa- 
rate thresholds by taking the weighted average of them. We 
weigh the equations by the correlation coefficient between 
the QRI score and the estimated student level thresholds 
(x), and the correlation coefficient between chapter difficulty 
and the estimated chapter level thresholds (y). We find the 
combined threshold as follows: 


aw * Achapter + Yy* Astudent (6) 
|x| + |y| 


where x = 0.05 and y = —0.39. Using this equation, we have 
one estimate for thinking time threshold for a given student 
and chapter based on both decoding ability of the student 
and chapter difficulty. 


Acombined = 


Finally, productive thinking time is defined as time spent on 
thinking before this threshold. If the time spent on thinking 
is less than this threshold for a given student and chapter 
pair, all thinking was productive and time spent on unpro- 
ductive thinking is 0. If the time spent on thinking is larger 
than the threshold, time spent on thinking until the thresh- 
old will be counted as productive thinking time and any 
time beyond the threshold will be counted as unproductive 
thinking time. 


5. PREDICTING COMPREHENSION 


The core hypothesis in our work is that dividing response 
time into subcategories in a way that could be indicative 
of some helpful and harmful cognitive states will improve 
predictive models of student performance. To test this hy- 
pothesis, we compared the proposed linear model (Decom- 
posed RT) to two baselines: one that uses response time as 
whole (Baseline 1), and another that uses response time as 
a non-linear predictor (Baseline 2) to show that we are not 
simply accounting for non-linearity in response time but we 
show identifying the states within response time will help 
us predict comprehension more accurately. We report AIC 
[1] and BIC [31] to show the improvement in the model is 
not because of the increased number of predictors. Table 3 
summarizes the feature sets we used in the 3 models we com- 
pare.We performed a cross-validation at the student level 
within a scheme for 50 iterations in which each time we left 
out a unique student pair from the whole procedure (decom- 
position of response time and training the models) and used 
their data for testing. 


Table 4 shows the average RMSE, R?, AIC and BIC values 
of the 50 iterations. For both datasets, we randomly flip 
the sign of the difference between paired model outcomes to 
conduct a paired-sample permutation test [37] to compare 
the mean of differences in evaluation metrics between our 
model and each baseline. We performed 1000 permutation 
trials in total. For the first dataset, we found significant 
improvements against both baselines in RMSE (p < 0.005), 
in AIC (p < 0.001), and in BIC (p < 0.001). 
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Table 3: Feature sets used in the models: Decomposed RT, Baseline 1, and Baseline 2. + indicates being a 
significant predictor (p < 0.05) of student performance in more than 80% of the folds. 


Decomposed RT (Linear regression) 


Baseline 1 (Linear regression) 


Baseline 2 (Piecewise regression) 


Frequency of gaming’ 

Frequency of help requests 

Chapter difficulty 7 

Student decoding score 1 

Time spent on reading ' 

Time spent on gaming’ 

Time spent on productive thinking 1 
Time spent on unproductive thinking 


a cee eo 


COT. Or ten Oe, Nr 


Frequency of gaming? t 
Frequency of help requests ' 
Chapter difficulty t 

Student decoding score ' 
Time spent on sentence 


Frequency of gaming! t 
Frequency of help requests 
Chapter difficulty * 
Student decoding score * 


ON ND 


t Time spent on sentence ' (as the 


non-linear parameter) 


Table 4: Comparison of evaluation metrics for proposed model and baselines on the first (Spanish) and second 


(Mandarin) dataset 


Model RMSE | OR” AIC BIC | 
~ 8 | Decomposed RT (Linear regression) | .267 (.027) | .220 (.013) | 80.242 (14.441) | 124.988 (14.446) 
“4 S| Baseline 1 (Linear regression) .275 (.034) | .160 (.013) | 122.399 (16.785) | 153.721 (16.796) 
™ & [Baseline 2 (Piecewise regression) | .273 (.031) | .193 (.012) | 100.868 (16.259) | 141.139 (16.276) 
= $ | Decomposed RT (Linear regression) | .268 (.034) | .131 (.014) | 58.350 (10.521) | 91.027 (10.521) 
9 & | Baseline 1 (Linear regression) 274 (.034 093 (.015 73.090 (10.301 97.599 (10.301 
& A | Baseline 2 (Piecewise regression) | .271 (.033) | .133 (.012) | 57.508 (10.067) 90.185 (10.068) 


For the second dataset, while decomposing response time, 
we made two adjustments. Firstly, since the students in this 
dataset were older (from 7th to 9th grade), their reading 
rates were adjusted for their grade level when calculating 
reading time. Secondly, the version of EMBRACE that was 
used to collect this data was not tracking when the stu- 
dents were pressing the “Next” button when it was disabled. 
Therefore, we discarded gaming time from our model. The 
remaining subcategories are calculated the same way as we 
did in the first dataset. The improvement in RMSE was 
significant against both Baseline 1 (p < 0.001) and Baseline 
2 (p < 0.005). The improvement in AIC and BIC was sig- 
nificant against Baseline 1 (p < 0.001) while Baseline 2 was 
significantly better than Decomposed RT (p < 0.05). 


Overall, Decomposed RT outperformed both baselines both 
in prediction error and the model fit criteria in the first 
dataset. However in the second dataset, although we see 
an improvement in prediction errors in favor of Decomposed 
RT, Baseline 2 had significantly better AIC and BIC values 
than Decomposed RT. 


6. CONCLUSION 


Within this paper, we proposed a new methodology to de- 
compose response time so that time spent on gaming the sys- 
tem, productive thinking, and unproductive thinking states 
within a single response time can be accounted. Results 
showed that, using the time spent on these states as sepa- 
rate predictors rather than using response time as a whole 
gave better predictions of student performance. Compari- 
son against another baseline that employs response time as 
a non-linear predictor also revealed that the improvement 
was not due to addressing the non-linearity in response time, 


'This measure is available only in the first dataset (Spanish). 


and using the decomposition of response time to explain how 
much time was spent on different cognitive states indeed 
yielded better predictions. Moreover, comparison of AIC 
and BIC values supported that the improvement in the pre- 
dictions were not due to introducing more predictors. How- 
ever, we could not observe the same results for AIC and BIC 
between the proposed model and the non-linear baseline on 
the Mandarin dataset. A possible explanation is that we 
were not able to estimate the time spent on gaming in this 
dataset, thus time estimates for the other states were not as 
accurate as in the first dataset. 


There are several other limitations of the work that need to 
be noted. Firstly, our estimation of reading time might not 
be the most accurate as there may be more factors influ- 
encing reading time than we addressed such as frequency of 
words and familiarity with the topic. Secondly, our model 
does not distinguish between the unproductive thinking be- 
haviors (mind wandering and wheel-spinning) in its current 
stage. We plan to further explore how we can capture these 
different kinds of unproductive thinking. 


In conclusion, we proposed a new method to use response 
time as a predictor in student modeling. The results show a 
promising improvement in predictive models of student per- 
formance when response time is decomposed into subcate- 
gories that can be indicative of the possible cognitive states 
students engage in. Future work should further assess this 
method’s generalizability to different student profiles and 
different domains. 
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